Multi-armed Bandits with Metric Switching Costs

نویسندگان

  • Sudipto Guha
  • Kamesh Munagala
چکیده

In this paper we consider the stochastic multi-armed bandit with metric switching costs. Given a set of locations (arms) in a metric space and prior information about the reward available at these locations, cost of getting a sample/play at every location and rules to update the prior based on samples/plays, the task is to maximize a certain objective function constrained to a distance cost of L and cost of plays C. This fundamental problem models several stochastic optimization problems in robot navigation, sensor networks, labor economics, etc. In this paper we consider two natural objective functions – future utilization and past utilization. We develop a common duality-based framework to provide the first O(1) approximation in the metric switching cost model; the actual constants being quite small. Since both problems are Max-SNP hard, this result is the best possible. We also show an “adaptivity” result, namely, there exists a policy which orders the arms and visits them in that fixed order without revisiting any arm and this policy gives at least Ω(1) fraction reward of the fully adaptive policy. The overall technique and the ensuing structural results are independently of interest in the context of bandit problems with complicated side-constraints. As a side-effect, our techniques also improve the approximation ratio of the budgeted learning problem from 4 to 3 + . ∗Department of Computer and Information Sciences, University of Pennsylvania, Philadelphia PA 19104-6389. Email: [email protected]. Research supported in part by an Alfred P. Sloan Research Fellowship, and an NSF CAREER Award CCF-0644119. †Department of Computer Science, Duke University, Durham NC 27708-0129. Email: [email protected]. Research supported by NSF via a CAREER award and grant CNS-0540347.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multi-Armed Bandits with Metric Movement Costs

We consider the non-stochastic Multi-Armed Bandit problem in a setting where there is a fixed and known metric on the action space that determines a cost for switching between any pair of actions. The loss of the online learner has two components: the first is the usual loss of the selected actions, and the second is an additional loss due to switching between actions. Our main contribution giv...

متن کامل

A Faster Index Algorithm and a Computational Study for Bandits with Switching Costs

We address the intractable multi-armed bandit problem with switching costs, for which Asawa and Teneketzis introduced in [M. Asawa and D. Teneketzis. 1996. Multi-armed bandits with switching penalties. IEEE Trans. Automat. Control, 41 328–348] an index that partially characterizes optimal policies, attaching to each project state a “continuation index” (its Gittins index) and a “switching index...

متن کامل

Budgeted Bandit Problems with Continuous Random Costs

We study the budgeted bandit problem, where each arm is associated with both a reward and a cost. In a budgeted bandit problem, the objective is to design an arm pulling algorithm in order to maximize the total reward before the budget runs out. In this work, we study both multi-armed bandits and linear bandits, and focus on the setting with continuous random costs. We propose an upper confiden...

متن کامل

A Linear Programming Relaxation and a Heuristic for the Restless Bandit Problem with General Switching Costs

We extend a relaxation technique due to Bertsimas and Niño-Mora for the restless bandit problem to the case where arbitrary costs penalize switching between the bandits. We also construct a one-step lookahead policy using the solution of the relaxation. Computational experiments and a bound for approximate dynamic programming provide some empirical support for the heuristic.

متن کامل

Multi-armed bandits on implicit metric spaces

The multi-armed bandit (MAB) setting is a useful abstraction of many online learning tasks which focuses on the trade-off between exploration and exploitation. In this setting, an online algorithm has a fixed set of alternatives (“arms”), and in each round it selects one arm and then observes the corresponding reward. While the case of small number of arms is by now well-understood, a lot of re...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009